<!--begin.rcode results='hide', echo=FALSE, message=FALSE

library(caret)
data(BloodBrain)
theme1 <- caretTheme()
theme1$superpose.symbol$col = c(rgb(1, 0, 0, .4), rgb(0, 0, 1, .4), 
  rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.symbol$pch = c(15, 16, 17)
theme1$superpose.cex = .8
theme1$superpose.line$col = c(rgb(1, 0, 0, .9), rgb(0, 0, 1, .9), rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.line$lwd <- 2
theme1$superpose.line$lty = 1:3
theme1$plot.symbol$col = c(rgb(.2, .2, .2, .4))
theme1$plot.symbol$pch = 16
theme1$plot.cex = .8
theme1$plot.line$col = c(rgb(1, 0, 0, .7))
theme1$plot.line$lwd <- 2
theme1$plot.line$lty = 1

    hook_inline = knit_hooks$get('inline')
    knit_hooks$set(inline = function(x) {
      if (is.character(x)) highr::hi_html(x) else hook_inline(x)
      })
    opts_chunk$set(comment=NA, digits = 3)

session <- paste(format(Sys.time(), "%a %b %d %Y"),
                 "using caret version",
                 packageDescription("caret")$Version,
                 "and",
                 R.Version()$version.string)
    end.rcode--> 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  <!--
  Design by Free CSS Templates
http://www.freecsstemplates.org
Released for free under a Creative Commons Attribution 2.5 License

Name       : Emerald 
Description: A two-column, fixed-width design with dark color scheme.
Version    : 1.0
Released   : 20120902

-->
  <html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <meta name="keywords" content="" />
  <meta name="description" content="" />
  <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  <title>Feature Selection Overview</title>
  <link href='http://fonts.googleapis.com/css?family=Abel' rel='stylesheet' type='text/css'>
  <link href="style.css" rel="stylesheet" type="text/css" media="screen" />
  </head>
  <body>
  <div id="wrapper">
  <div id="header-wrapper" class="container">
  <div id="header" class="container">
  <div id="logo">
  <h1><a href="#">Feature Selection Overview</a></h1>
</div>
  <!--
  <div id="menu">
  <ul>
  <li class="current_page_item"><a href="#">Homepage</a></li>
<li><a href="#">Blog</a></li>
<li><a href="#">Photos</a></li>
<li><a href="#">About</a></li>
<li><a href="#">Contact</a></li>
</ul>
  </div>
  -->
  </div>
  <div><img src="images/img03.png" width="1000" height="40" alt="" /></div>
  </div>
  <!-- end #header -->
<div id="page">
  <div id="content">
  
<h1>Contents</h1>  
<ul>
  <li><a href="#builtin">Models with Built-In Feature Selection</a></li>
  <li><a href="#methods">Feature Selection Methods</a></li> 
  <li><a href="#external">External Validation</a></li> 
 </ul>   

<div id="builtin"></div>   
<h1>Models with Built-In Feature Selection</h1>

<!--begin.rcode tags, results='hide', echo=FALSE, message=FALSE
mods <- getModelInfo()
ifs <- unlist(lapply(mods, function(x) any(grepl("Implicit", x$tags))))
ifs <- paste("<code>", names(ifs)[ifs], "</code>", sep = "", collapse = ", ")
    end.rcode--> 

<p>
Many models that can be accessed using <a href="http://cran.r-project.org/web/packages/caret/index.html"><strong>caret</strong></a>'s <span class="mx funCall">train</span> function produce prediction equations that do not necessarily use all the predictors. These models are thought to have built-in feature selection: <code>ada</code>, <code>bagEarth</code>, <code>bagFDA</code>, <code>bstLs</code>, <code>bstSm</code>, <code>C5.0</code>, <code>C5.0Cost</code>, <code>C5.0Rules</code>, <code>C5.0Tree</code>, <code>cforest</code>, <code>ctree</code>, <code>ctree2</code>, <code>cubist</code>, <code>earth</code>, <code>enet</code>, <code>evtree</code>, <code>extraTrees</code>, <code>fda</code>, <code>gamboost</code>, <code>gbm</code>, <code>gcvEarth</code>, <code>glmnet</code>, <code>glmStepAIC</code>, <code>J48</code>, <code>JRip</code>, <code>lars</code>, <code>lars2</code>, <code>lasso</code>, <code>LMT</code>, <code>LogitBoost</code>, <code>M5</code>, <code>M5Rules</code>, <code>nodeHarvest</code>, <code>oblique.tree</code>, <code>OneR</code>, <code>ORFlog</code>, <code>ORFpls</code>, <code>ORFridge</code>, <code>ORFsvm</code>, <code>pam</code>, <code>parRF</code>, <code>PART</code>, <code>penalized</code>, <code>PenalizedLDA</code>, <code>qrf</code>, <code>relaxo</code>, <code>rf</code>, <code>rFerns</code>, <code>rpart</code>, <code>rpart2</code>, <code>rpartCost</code>, <code>RRF</code>, <code>RRFglobal</code>, <code>smda</code>, <code>sparseLDA</code>. Many of the functions have an ancillary method called <span class="mx funCall">predictors</span> that returns a vector indicating which predictors were used in the final model.
</p>
<p>
In many cases, using these models with built-in feature selection will be more efficient than algorithms where the search routine for the right predictors is external to the model. Built-in feature selection typically couples the predictor search algorithm with the parameter  estimation and are usually optimized with a single objective function (e.g. error rates or likelihood). 
</p>

<div id="methods"></div>   
<h1>Feature Selection Methods</h1>

<p>
Apart from models with built-in feature selection, most approaches
for reducing the number of predictors can
be placed into two main categories. Using the terminology of
<a href="http://scholar.google.com/scholar?q=%22Irrelevant+Features+and+the+Subset+Selection+Problem">John, Kohavi, and Pfleger (1994)</a>:
</p>
<ul>
  <li><i>Wrapper</i> methods 
  evaluate multiple models using procedures that add and/or remove
  predictors to find the optimal combination that maximizes model
  performance. In essence, wrapper methods are search algorithms that treat
  the predictors as the inputs and utilize model performance as the output to be optimized.
  <strong>caret</strong> has wrapper methods based on <a href="rfe.html">recursive feature elimination</a>, <a href="ga.html">genetic algorithms</a>, and <a href="sa.html">simulated annealing</a>. 
  </li>
  <li><i>Filter</i> methods
  evaluate the relevance of the predictors outside of the predictive
  models and subsequently model only the predictors that pass some
  criterion. For example, for classification problems, each predictor
  could be individually evaluated to check if there is a
  plausible relationship between it and the observed classes.
  Only predictors with important relationships would then be included in a classification
  model. <a href="http://scholar.google.com/scholar?q=%22A+review+of+feature+selection+techniques+in+bioinformatics">Saeys,  Inza, and Larranaga (2007)</a> surveys filter methods.
  <strong>caret</strong> has a general framework for using <a href="filters.html">univariate filters</a>.
  </li>
</ul>
<p>
Both approaches have advantages and drawbacks. Filter methods are
usually more computationally efficient than wrapper methods, but
the selection criterion is not directly related to the effectiveness of the
model. Also, most filter methods evaluate each predictor separately
and, consequently, redundant (i.e. highly-correlated) predictors
may be selected and important interactions between variables will not be able to be
quantified. The downside of the wrapper method is that many models are
evaluated (which may also require parameter tuning) and thus an increase in
computation time. There is also an increased risk of over-fitting with wrappers.
<p>


<div id="external"></div>   
<h1>External Validation</h1>

<p>
It is important to realize that feature selection is part of the model building process and, as such, should be externally validated. Just as parameter tuning can result in over-fitting, feature selection can over-fit to the predictors (especially when search wrappers are used). In each of the <strong>caret</strong> functions for feature selection, the selection process is included in any resampling loops. See  
</p>
<p>
See <a href="http://scholar.google.com/scholar?q=%22Selection+bias+in+gene+extraction+on+the+basis+of+microarray+gene-expression+data">Ambroise and McLachlan (2002)</a> for a demonstration of this issue.
</p>


<!-- ------------------------ end #content------------------------ -->

<div style="clear: both;">&nbsp;</div>
  </div>
  <!-- end #content -->
<div id="sidebar">
<ul>
  <li>
    <h2>General Topics</h2>
    <ul>
      <li><a href="index.html">Front Page</a></li>
      <li><a href="visualizations.html">Visualizations</a></li>
      <li><a href="preprocess.html">Pre-Processing</a><li>
      <li><a href="splitting.html">Data Splitting</a></li>
      <li><a href="varimp.html">Variable Importance</a></li>
      <li><a href="other.html">Model Performance</a></li>
      <li><a href="parallel.html">Parallel Processing</a></li>
    </ul>
    <h2>Model Training and Tuning</h2>
    <ul>
      <li><a href="training.html">Basic Syntax</a></li>
      <li><a href="modelList.html">Sortable Model List</a></li>
      <li><a href="bytag.html">Models By Tag</a></li>
      <li><a href="similarity.html">Models By Similarity</a></li>
      <li><a href="custom_models.html">Using Custom Models</a></li>
      <li><a href="sampling.html">Sampling for Class Imbalances</a></li> 
      <li><a href="random.html">Random Search</a></li> 
      <li><a href="adaptive.html">Adaptive Resampling</a></li> 
    </ul>
    <h2>Feature Selection</h2>
    <ul>
      <li><a href="featureselection.html">Overview</a>
      <li><a href="rfe.html">RFE</a></li>
      <li><a href="filters.html">Filters</a></li>
      <li><a href="GA.html">GA</a></li>
      <li><a href="SA.html">SA</a></li>
    </ul>  
  </li>
</ul>
</div>
<!-- end #sidebar -->
<div style="clear: both;">&nbsp;</div>
  </div>
  <div class="container"><img src="images/img03.png" width="1000" height="40" alt="" /></div>
  <!-- end #page -->
</div>
  <div id="footer-content"></div>
  <div id="footer">
<!--begin.rcode echo = FALSE
knit_hooks$set(inline = hook_inline)    
    end.rcode-->   
  <p>Created on <!--rinline I(session) -->.</p>
  </div>
  <!-- end #footer -->
</body>
  </html>
